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Abstract 

Background: WRKY genes encode one of the most abundant groups of transcription factors in higher plants, and 
its members regulate important biological process such as growth, development, and responses to biotic and 
abiotic stresses. Although the soybean genome sequence has been published, functional studies on soybean genes 
still lag behind those of other species. 

Results: We identified a total of 133 WRKY members in the soybean genome. According to structural features of 
their encoded proteins and to the phylogenetic tree, the soybean WRKY family could be classified into three 
groups (groups I, II, and III). A majority of WRKY genes (76.7%; 102 of 133) were segmentally duplicated and 13.5% 
(18 of 133) of the genes were tandemly duplicated. This pattern was not apparent in Arabidopsis or rice. The 
transcriptome atlas revealed notable differential expression in either transcript abundance or in expression patterns 
under normal growth conditions, which indicated wide functional divergence in this family. Furthermore, some 
critical amino acids were detected using DIVERGE v2.0 in specific comparisons, suggesting that these sites have 
contributed to functional divergence among groups or subgroups. In addition, site model and branch-site model 
analyses of positive Darwinian selection (PDS) showed that different selection regimes could have affected the 
evolution of these groups. Sites with high probabilities of having been under PDS were found in groups I, II c, II e, 
and III. Together, these results contribute to a detailed understanding of the molecular evolution of the WRKY gene 
family in soybean. 

Conclusions: In this work, all the WRKY genes, which were generated mainly through segmental duplication, were 
identified in the soybean genome. Moreover, differential expression and functional divergence of the duplicated 
WRKY genes were two major features of this family throughout their evolutionary history. Positive selection analysis 
revealed that the different groups have different evolutionary rates. Together, these results contribute to a detailed 
understanding of the molecular evolution of the WRKY gene family in soybean. 



Background 

The WRKY family is one of the largest transcription fac- 
tor families in higher plants, but is absent in animals, 
extending throughout the entire green lineage. Recently, 
several WRKY genes were identified from non-plant eu- 
karyotes, including Dictyostelium discoideum, a slime 
mold closely related to the animal and fungi lineages, 
and the green alga Chlamydomonas reinhardtii, an early 
branch of plants. This suggests that WRKY genes may 
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have had an early origin in lower eukaryotes, which have 
since greatly expanded in plant species [1]. Since the 
first WRKY protein, SPF1, was cloned from sweet potato 
[2], more and more WRKY genes have been experimentally 
identified in various plant species [3-19]. Each WRKY pro- 
tein in this family contains at least one WRKY domain of 
approximately 60 amino acids with the conserved amino 
acid sequence WRKYGQK at its N-terminus and a novel 
zinc finger motif, C 2 H 2 (C-X 4 _ 5 -C-X 2 2-23-H-X-H) or 
C 2 HC (C-X 7 -C-X 23 -H-X-C), at the C-terminal region 
[20]. The WRKYGQK amino acid sequence forms a 
(3-strand that facilitates binding to the promoters of target 
genes. Usually, the binding site is the W box, which is an 
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element commonly found in the promoters of many 
stress-related plant genes [21]. WRKY proteins can be 
categorized into three groups based on their number 
of WRKY domains and the pattern of their zinc finger 
motif [22]. The first group contains two WRKY domains 
(N-terminal and C-terminal), including a C 2 H 2 motif, 
whereas the other two groups have only one domain. 
Group III has a distinct zinc finger motif, C 2 HC ra- 
ther than the C 2 H 2 found in other groups. Group II pro- 
teins can be further subdivided into groups II a, II b, II c, 
II d, and II e based on the amino acid motifs contained 
outside the WRKY domain. 

As transcription factors, plant WRKY proteins have 
been shown to be involved in responses to biotic and 
abiotic stresses, and in developmental processes [23]. It 
has been well documented that WRKY proteins play an 
important role in plant defense against biotic stresses, 
such as bacterial, fungal, and viral pathogens [24-27]. 
They are also key components in developmental pro- 
cesses, including embryogenesis [28], senescence [29], 
dormancy [30], trichome development [31], seed devel- 
opment [32], and some signal transduction processes me- 
diated by plant hormones such as gibberellic acid [33], 
abscisic acid (ABA) [34], or salicylic acid [35]. Mean- 
while, increasing evidence has revealed that WRKY pro- 
teins are involved in responses to various abiotic stresses 
[36]. In Arabidopsis, results of a microarray study dem- 
onstrated that the expressions of some WRKY transcripts 
are regulated in response to abiotic stresses, includ- 
ing salinity, drought, and cold [37-39]. In rice, under 
various abiotic and phytohormone treatments, the ex- 
pression of WRKY genes showed significant differences 
[40]. In Poncirus trifoliate, a WRKY gene, P£rWRKY2, 
showed differential responses to cold and drought stresses 
[41], while in soybean, at least nine WRKY genes were 
found to be differentially expressed under abiotic stress 
[42]. Collectively, this evidence indicates that WRKY genes 
play important roles in various physiological processes 
in plants. 

Soybean is one of the most important economic crops 
in the world. Genome and transcriptome sequencing 
of the palaeopolyploid soybean have been completed 
[43,44], In the present study, we searched this genome se- 
quence to identify WRKY proteins, and compared the 
structure of the encoded proteins with those of their puta- 
tive homologous WRKY genes in Arabidopsis. In order to 
investigate tandem duplication events, soybean chromo- 
some sequence information was applied to map WRKY 
transcripts to their corresponding genetic loci on chromo- 
somes. A phylogenetic tree was constructed to evaluate 
evolutionary relationships among WRKY genes in the two 
plant species. In addition, we analyzed the transcriptome 
atlas of WRKY genes in different tissues under normal con- 
ditions, and found notable differential expression between 



groups, which indicated broad functional divergence in this 
family. Positive selection analysis revealed that evolutionary 
rates differed among the different groups. Moreover, evolu- 
tionary patterns of the WRKY gene family were examined 
in Arabidopsis, rice, and soybean. The results indicated that 
WRKY genes in soybean were duplicated mainly through 
segmental duplication, which differed from homologous 
genes in Arabidopsis and rice. These results provide useful 
information for future studies of molecular evolution of 
the WRKY gene family in soybean. 

Results 

Identification and distribution of the WRKY gene family 
in soybean 

In plants, the dicot model organism Arabidopsis is com- 
monly used to predict the function of a gene in a newly 
or partially sequenced organism that has a close phylo- 
genetic relationship to Arabidopsis, such as soybean. 
Moreover, there are at least 72 WRKY family members 
in Arabidopsis, and most of these genes have been exten- 
sively studied and reported to be involved in many physio- 
logical and biochemical processes [20,22]. With the aim of 
defining the soybean protein-containing WRKY domains, 
we downloaded the 72 known Arabidopsis WRKY protein 
sequences from the Arabidopsis transcription factor data- 
base (AtTFDB; http://arabidopsis.med.ohio-state.edu/). In 
order to examine the structural features of each AtWRKY, 
we performed a multiple sequence alignment (data not 
shown). Two members, At4gl2020 and At4g30930, were 
excluded from the analysis due to incomplete zinc fingers 
and the lack of WRKY domains. Therefore, 70 Arabidopsis 
WRKY protein sequences were used to BLAST the com- 
pleted soybean genome sequences for genes that encode 
proteins containing the WRKY domain. The WRKY do- 
main of each predicted protein was identified by searching 
against the SMART database. After manually removing 
overlapping genes, a total of 133 non-redundant genes in 
the soybean genome were identified as members of the 
WRKY family (Additional file 1). Among these members, 
annotation (predicted) of 23 proteins revealed that they 
have two complete WRKY domains each, which all be- 
long to group I. In addition, physical positions of WRKY 
genes were obtained from the Phytozome database, and 
were used to map these genes onto their corresponding 
soybean chromosomes (Figure 1). Results showed that 
WRKY genes in soybean could be mapped on all chro- 
mosomes, from chromosome 1 to 20. Chromosome 8 
had the highest density of WRKY genes with 11 mem- 
bers, while in chromosomes 10, 11, 12, and 20, no more 
than three WRKY genes could be found, respectively. 
Examination of the location of each WRKY gene revealed 
that all GmWRKY genes, except for Glyma02g39870, 
Glyma03g25770, Glyma04g05700, Glyma04gl2830, Glyma 
05g20710, Glyma06g06530, Glyma06gl3090, Glyma06g 
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Figure 1 Chromosome distribution of soybean {Glycine max) WRKY genes. The size of a chromosome is indicated by its relative 
length. Red outlined boxes represent segmentally duplicated genes. Tandem duplicated genes are indicated with vertical green lines. 
The location information and chromosome information were obtained from Phytozome. The figure was produced using the 
Maplnspector program. 



27440, Glyrna06g47880, Glyma08g01430, Glyma09g09400, 
Glyma09g24080, Glyma09g37930, Glyrnallg29720, Glyma 
12g239S0, Glymal3g34280, Glyrnal4gl7330, Glymal4g 
36430, Glymal4g37960, Glyma 13g20990, Glymal6gOS880, 
Glymal7g29190, Glyma 18g39970, Glyma 18g49 140, Glyma 
19g02440, Glymal9g26400, and Glyma20g03410, origi- 
nated from segmental duplications (102 of 133) or tan- 
dem duplications (18 of 133) (Figure 1). The 27 genes 
mentioned above might have been produced by retro- 
transposition instead. 

Multiple sequence alignment and structure analysis 

The phylogenetic relationship of GmWRKY proteins 
was examined by multiple sequence alignment of their 
WRKY domains, which span across approximately 60 
amino acids (Figure 2). A comparison with soybean 
WRKY domains and several homologous Arabidopsis 
proteins resulted in a separation of the different groups 
and subgroups. For each group or subgroup, one 
Arabidopsis protein was selected randomly, which in- 
cluded At2g04880C, with only one C-terminal WRKY 
domain, At4g26440N, with only one N-terminal WRKY 
domain, Atlg80840, Atlgl8860, Atlg69310, At2g30590, 
Atlg29280, and At2g46400. As shown in Figure 2, the 



sequences of soybean WRKY were found to be highly 
conserved. 

In order to better separate the groups and examine the 
evolutionary relationships of the WRKY family in soy- 
bean and Arabidopsis, an unrooted phylogenetic tree was 
constructed from alignments of their domain protein se- 
quences, which resulted in the formation of three distinct 
clusters: group I, group II, and group III (Figure 3). WRKY 
proteins of Arabidopsis and soybean were present in all 
clusters. This classification was consistent with results of 
Rushton et al. [20], who suggested that WRKY domains 
could be classified into three large groups corresponding 
to groups I, II, and III of Arabidopsis. Notably, AtWRKY 
members were more similar to those in the same class in 
divergent species than they were to other WRKY proteins 
in the same species. In order to examine the syntenic rela- 
tionship of the WRKY gene family between the genomes 
of soybean and Arabidopsis, each WRKY gene within the 
family in Arabidopsis was searched in the PGDD (http:// 
chibba.agtec.uga.edu/duplication/) (data no shown). During 
the course of this analysis we found that synteny was rela- 
tively well conserved between soybean and Arabidopsis 
proteins. For example, in subgroup II a (Figure 3), several 
GmWRKYs (Glyma06g06S30, Glyma07g02630, Glyma 
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Figure 2 Alignment of multiple soybean WRKY genes and selected /AfWRKY domain amino acid sequences. Alignment was performed 
using Clustal W. The suffixes 'N' or 'C denote the N-terminal and C-terminal WRKY domains from Group I WRKY proteins, respectively. The amino 
acids forming the zinc-finger motif are highlighted in yellow. The conserved WRKY amino acid signature is highlighted in blue. The four (3-strands 
are shown in red. The position of a conserved intron is indicated by an arrowhead. 



08g23380 Glyrnal3g44730, and GlymalSg00S70) lo- 
cated on different chromosomes are orthologs of a same 
AtWRKY gene (Atlg80840). Additionally, it is worth noting 
that the structure and phylogenetic tree of the GmWRKY 
domain clearly indicated that group II proteins could be di- 
vided into five distinct subgroups (II a-e). 

The phylogenetic classification was found to be con- 
sistent with the motif composition among group or 



subgroup. Differences between groups or subgroups 
were observed in not only the type of motifs in one 
WRKY protein, but also in the motif number in one 
WRKY protein. As displayed schematically in Additional 
file 2 and Additional file 3, nine types of motifs were 
detected, including three types of WRKY motifs. The 
majority of the proteins of subgroups I (73.9%; 17 of 23), 
II c (93.8%; 30 of 32) and II d (93.3%; 14 of 15), together 
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Figure 3 Phylogenetic tree of WRKY domains among soybean and Arabidopsis. The amino acid sequences of the WRKY domain of soybean 
and Arabidopsis were aligned with Clustal W and the phylogenetic tree was constructed using the maximum likelihood method in MEGA 5.0. 
Group 1 proteins with the suffix 'N' or 'C indicate the N-terminal WRKY domains or the C-terminal WRKY domains, respectively. Genes with similar 
functions clustered together are indicated by filled green circles. Gene expansion in soybean and Arabidopsis are indicated by coloring the 
subclade with the same color as the leaf label. The red arcs indicate different groups (or subgroups) of WRKY domains. 

V ) 



with those of group III (87.5%; 14 of 16), share a unique 
WRKY motif, which is shown in red color in Additional 
file 3. Subgroups II a and II b have the same motif com- 
ponents, suggesting a close phylogenetic relationship. 
The motif number in each WRKY protein ranged from 
two to six, and this difference is apparent in groups or 
subgroups of the WRKY family. For example, all mem- 
bers of group III and the majority of subgroups II e and 
II d members have two motifs, including a WRKY motif. 



Interestingly, the relative motif positions in different 
groups or subgroups also vary significantly. Therefore, 
motif composition can shed light on phylogenetic rela- 
tionships of the WRKY family. 

Comparison of full-length cDNA sequences with cor- 
responding genomic DNA sequences suggested that the 
exon number of soybean WRKY genes ranged from two 
to eight. The results of intron/exon structure identifica- 
tion (Additional file 4) showed that most of the WRKY 
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domain-containing protein genes in different groups or 
subgroups have very conservative exon numbers. Mem- 
bers of group II d have three exons, and 14 of the 16 genes 
in group III have three exons. Notably, except for their dif- 
ferences in exon numbers, the relative exon positions in 
different groups or subgroups also vary significantly. The 
exon/intron analysis showed clear differences in both exon 
positions and exon numbers across the different groups or 
subgroups. 

Together, these results indicated that soybean WRKY 
domains could be classified into three large groups: 
group I, group II, including group II a-e, and group III. 
Basic information of all soybean WRKY family members, 
including conserved heptapeptide, zinc-finger type, do- 
main number, group, coding sequence (CDS) length, and 
gene length, is provided in Additional file 1. 

Transcriptome atlas and promoter analysis 

Since the transcriptome sequencing of soybean was 
completed, the availability of the soybean gene expres- 
sion atlas facilitates additional studies on the basic biol- 
ogy of soybean [44]. The recently developed RNA-Seq 
web-based tools, which include gene expression data 
across multiple tissues and organs, allow for charac- 
terization and comparisons of the gene transcriptome 
atlas in soybean. Consequently, distinct transcript abun- 
dance patterns are readily identifiable in the RNA-Seq 
atlas data set for all 133 GmWRKYs. Similar to other 
genes that encode transcription factors, many of the 
GmWRKYs exhibited low transcript abundance levels, as 
determined by the RNA-Seq atlas analysis. Furthermore, 
we observed that most of the genes had very broad ex- 
pression spectra. However, six GmWRKYs, including 
Glyma04gOS700 1 Glyma04g39650, Glyrna04g40120, Glyma 
08g01430, GlymaOlgOSOSO, and Glyma 14gl 1440, lacked 
expression information, which possibly indicated that these 
were pseudogenes or were expressed only at specific devel- 
opmental stages or under special conditions. We observed 
that accumulation of WRKY gene transcripts was associ- 
ated with different tissues, and expression patterns differed 
among each WRKY gene member (Figure 4). In soybean, 
33.1% (44 of 133) of the analyzed WRKYs were constitu- 
tively expressed in all of the seven tissues tested, which 
suggested that GmWRKYs play regulatory roles at mul- 
tiple developmental stages. By contrast, most GmWRKYs 
showed preferential expression. RNA-Seq atlas data re- 
vealed that the majority (92 of 133; 69.2%) of GmWRKYs 
exhibited transcript abundance profiles with marked peaks 
in only a single tissue. This result suggests that these 
WRKY proteins function as tissue-specific regulators and 
are limited to discrete cells or organs. Approximately 45 of 
these 133 (33.8%) GmWRKYs showed the highest tran- 
script accumulation level in root tissue, 20 (15.0%) showed 
the highest transcript accumulation in flower tissue, 13 



(9.8%) showed the highest transcript accumulation level in 
nodule tissue, and surprisingly, only one showed the 
highest transcript accumulation level in seed tissue. The 
wide expression of these genes suggests that WRKY genes 
from soybean are involved in the development of all organs 
or tissues under normal conditions. In addition to groups 
of genes that exhibited similar transcript abundance pro- 
files but were relatively phylogenetically distinct, several 
phylogenetic clades shared the same transcript abundance 
profile to a large extent. For example, in subgroup II b, 
most of the GmWRKYs were preferentially expressed in 
root tissue. Interestingly, in Arabidopsis, according to the 
experimental results of Dong et al. [45], more than half of 
the members in subgroup II b showed similar preferential 
expression in the root tissue under normal conditions, 
which indicated the conserved functional role of subgroup 
II b in root development between the two species. The ex- 
pression of members of group I in soybean was detectable 
in flower tissue, which suggested their conserved roles in 
flower formation. Members of group I also showed similar 
expression patterns in nodule and root development. Fur- 
thermore, GmWRKYs with high sequence similarity and 
shared expression profiles represent good candidates for 
evaluation of gene functions in soybean. The transcriptome 
atlas indicted that differential expression was extended to 
all groups or subgroups of the soybean WRKY gene family, 
which was further verified by the promoter analysis. Be- 
cause transcription factors bind to corresponding tran- 
scription factor binding sites (TFBSs) upstream of genes of 
interest, profiles of ds-acting elements may thus provide 
useful information related to the regulatory mechanism of 
gene expression. A computational tool, PlantCARE [46], 
was adopted to identify putative TFBSs in the 1500-bp 
DNA sequence upstream of the translation initiation 
codon of WRKY genes in soybean. Four types of ds-acting 
elements were found to be significantly abundant in the 
promoter region of Gm WRKY genes (Additional file 5). 
The first type of ds-acting element enriched in the pro- 
moter region is the light responsive elements, such as 
G-box [47,48], G AG-motif [49], Box I [50], and Box 4 
[51], etc. The G-box element appears to be more abundant 
in subgroup II a, in which each member contains at least 
two copies. In addition, the mean number of G-box copies 
was 3.625 in subgroup II a, which is higher than that in 
other subgroups or in the whole WRKY gene family. This 
result indicates that the G-box element seems to be 
enriched in subgroup II a. All but seven (94.7%; 126 of 133, 
Group I; 2, Group II b; 2, Group II c; 1, Group II d; 1, and 
Glyma 14g3 7960; 1) have at least one Box 4 element copy. 
Its mean number of copies (3.744) in the whole WRKY 
gene family was apparently higher than that of other types 
of ds-acting elements except for TATA-box, CAAT-box, 
and unnamed-4. This result suggests that the Box 4 elem- 
ent tends to be enriched in the soybean WRKY gene 
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Figure 4 Expression profiles of 127 soybean WRKY genes. The hierarchical cluster color code: the largest values are displayed as the reddest (hot), 
the smallest values are displayed as the bluest (cool), and the intermediate values are a lighter color of either blue or red. Pearson correlation clustering 
was used to group the developmentally regulated genes. Six genes were excluded from the analysis due to no expression in an organ or a period. 
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family. As one part of a conserved DNA module involved 
in light responsiveness, previous studies showed that the 
Box 4 element is frequently found in promoter regions of 
different genes from various species [51,52]. However, it is 
noteworthy that the Box 4 element was found in high 
frequency in the soybean WRKY gene family, which sug- 
gests that the Box 4 element may be important for light- 
controlled transcriptional activity [53]. Plant hormone 
responsive elements, such as ABRE [54], P-box [55], as well 
as the TCA-element [56], constitute the second class. 
ABRE (51.9%; 69 of 133) appears to be one of the most 
abundant hormone-related ds-acting element in soybean, 
suggesting that abscisic acid (ABA) regulates the expres- 
sion of some GmWRKYs, whereas such elements were 
rarely detected in Group II d (20.0%; 3 of 15). By contrast, 
the salicylic acid responsive TCA-element was frequently 
found in groups or subgroups. These observations suggest 
that GmWRKYs in different groups are likely to be signifi- 
cantly regulated by different types of hormones. The third 
most abundant ds-acting element class consisted of ele- 
ments that respond to external environmental stresses. We 
observed that most of the GmWRKYs examined appeared 
to contain MBS (72.2%; 96 of 133) [57], heat shock element 
(HSE) (77.4%; 103 of 133) [58], and TC-rich repeat ele- 
ments (71.4%; 95 of 133) [59]. MBS is an element involved 
in drought induction, and HSE is also enriched in the pro- 
moter. With a few exceptions, GmWRKYs contain more 
than two copies of this element. Circadian elements, which 
are involved in circadian control [60], is the fourth type 
of ds-acting element that was abundantly found in the 
promoter regions of soybean WRKY genes. PlantCARE 
identified 98 (73.7%; 98 of 133) GmWRKY genes con- 
taining circadian elements, which may be responsible 
for its distinct diurnal expression pattern. The presence 
of a diversity of ds-acting elements in the upstream regions 
of GmWRKYs indicates that GmWRKYs may function in a 
relatively wide range of activities. 

The above results indicated that the 133 WRKY genes 
in soybean display differential expression, either in their 
transcript abundance or in their expression patterns 
under normal growth conditions in different groups or 
subgroups. 

Detection of positive selection and functional divergence 
analysis (FDA) 

Site models and branch-site models in PAML [61] were 
used to detect positive selection in the WRKY gene family 
of soybean. Substitution rate ratios of non-synonymous 
(dN or Ka) versus synonymous (dS or Ks) mutations 
(dN/dS or co) were calculated. A Ka/Ks ratio of 1 indicates 
genes that are subject to neutral selection, <1 indicates 
genes subject to negative selection, and >1 indicates genes 
subject to positive selection [62]. Additional file 6 lists par- 
ameter estimates and log-likelihood values for each site 



model. Two pairs of models (M0/M3 and M7/M8) were 
selected and compared. The site-homogeneous model, M0 
(one-ratio), assumes one co for all sites, whereas M3 
(discrete) assumes a general discrete distribution. Two 
other models used were M7 (beta), which assumes a beta 
distribution of co that is limited to the range (0, 1), and M8 
(beta & co), which adds an extra site class with co estimated 
[63]. In addition, to test for variable omega ratios among 
lineages, we conducted the likelihood ratio test (LRT) to 
compare the two extreme models. The log likelihood 
values under the one-ratio model and the discrete model 
were determined to be -8608.409 and -8272.457, respect- 
ively. Twice the log likelihood difference value, 2AlnL = 
335.95, was found to be strongly statistically significant 
(p < 0.01), thus revealing a heterogeneous selective pres- 
sure among lineages. Moreover, the log likelihood value 
under the beta model and the beta & co were -8253.022 
and -10197.202, respectively. Twice the log likelihood dif- 
ference, 2AlnL = 1944.18, was also strongly statistically 
significant (p < 0.01). The comparison of M3 versus M0 re- 
vealed that none of the codon sites appeared to be under 
the influence of positive selection (co > 1). By contrast, 
comparing the M7 model to the M8 model indicated 
that -0.001% of codons fell within an estimated co value of 
2.638, suggesting positive selection. We also used Bayes 
empirical Bayes (BEB) estimation methods in model M8 
[64] to identify sites under positive selection. We found 
only one positive selection site at the 0.05 significance level, 
and three sites at the 0.01 significance level. Together, these 
results indicate that no strong positive selection sites could 
be detected under the site model in the soybean WRKY 
gene family. 

Branch-site models allow co to vary both among sites 
in the protein and across branches on the tree, and aim 
to detect positive selection affecting a few sites along 
particular lineages [64]. The branches being tested for 
positive selection are referred to as the foreground 
branches, and all other branches on the tree are referred 
to as background branches. The BEB method was im- 
plemented to calculate posterior probabilities (Qks) for 
site classes if the LRT suggested the presence of codons 
under positive selection on the foreground branch [65]. 
In our study, group I, group II a-e, and group III were se- 
lected as foreground branches, respectively, while the other 
groups were selected as the background branches. It is not- 
able that no positive selection sites were observed in groups 
II a, II b, or II d. In contrast, positive selection sites detected 
by the branch-site model (Table 1) were distributed in 
groups II e and III at the 0.01 significance level. This result 
demonstrated that the groups have different evolutionary 
rates. Group II e and group III appeared to be confronted 
with strong positive Darwinian selection, as many highly 
significant positive sites were present, whereas evolution in 
the other groups appeared to be more conservative. 
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Table 1 Parameters estimation and likelihood ratio tests for the branch-site models 



Branch-site model 



Foreground 
branches 




Estimates of parameter 




Positive selection sites (BEB) ' 


Site class 1 0 


Site class 1 


Site class 2a 


Site class 2b 


Group 1 


P 0 = 0.75959 


Pt =0.05958 


P 2a = 0.1 6768 


P 2b = 0.01315 


261G*,262S*,282H*, 




00o(b) = 0.05978 


oo 1( b)= 1.00000 


oo 2a(b ) = 0.05978 


oo 2b (b) = 1 -00000 


288D*, 292 M* 




oo 0( f) 3 = 0.05978 


oo 1(f )= 1.00000 


oo 2a(f) = 232.95 169 


oo 2b (f) = 232.95 169 




Group 2a 


P 0 = 0.00000 


Pt =0.00000 


P 2a = 0.92727 


P 2b = 0.07273 


None 




oo 0( b) = 0.06077 


oo 1( b)= 1.00000 


oo 2a(b ) = 0.06077 


oo 2b (b) = 1 -00000 






oo 0 (f) = 0.06077 


oo 1(f )= 1.00000 


^2a(f) = 999.00000 


co 2b(f) = 999.00000 




Group 2b 


P 0 = 0.00001 


Pt =0.00000 


P 2a = 0.92726 


P 2b = 0.07273 


None 




oo 0 (b) = 0.06039 


OJ 1(b) = 1.00000 


oo 2a(b ) = 0.06039 


a) 2b (b) = 1 -00000 






oo 0 (f) = 0.06039 


oo 1(f )= 1.00000 


W2a(f) = 1 -00000 


oo 2b (f) = 1 .00000 




Group 2c 


P 0 = 0.80507 


Pt =0.06312 


P 2a = 0.1 2223 


P 2b = 0.00958 


261G**,275R** 




oo 0( b) = 0.06042 


oo 1(b )= 1.00000 


oo 2a(b) = 0.06042 


^2b(b) = 1 -00000 






oo 0 (f) = 0.06042 


oo 1(f) = 1.00000 


oo 2a(f) = 167.23585 


oo 2b (f)= 167.23585 




Group 2d 


P 0 = 0.00000 


Pt =0.00000 


P 2a = 0.92727 


P 2b = 0.07273 


None 




oo 0( b) = 0.061 18 


(jdi( b ) = 1.00000 


oo 2a(b ) = 0.061 18 


oo 2b (b) = 1 -00000 






oo 0 (f) = 0.061 18 


(jj 1(f) = 1.00000 


oo 2a(f) = 981.94932 


w 2 b(f) = 981.94932 




Group 2e 


P 0 = 0.76730 


Pt =0.06008 


P 2a = 0.1 6008 


P 2b = 0.01 253 


248E** # 249Y** # 286A*, 




oo 0 (b) = 0.06061 


oo 1( b)= 1.00000 


oo 2a (b) = 0.06061 


oo 2b (b) = 1 -00000 


288D**, 298E**, 299G** 




oo 0 (f) = 0.06061 


OJ 1(f) = 1.00000 


w 2a (f) = 999.00000 


^2b(f) = 999.00000 




Group 3 


P 0 = 0.63465 


Pt =0.04978 


P 2a = 0.29262 


P 2b = 0.02295 


258P*, 260 K** ; 263P**, 275R*, 




oo 0( b) = 0.061 76 


(jJi ( b) = 1.00000 


oo 2a(b ) = 0.061 76 


oo 2b (b) = 1 -00000 


293 L* ; 294I**, 298E**, 303H** 




O)o(f) = 0-061 76 


oo 1(f )= 1.00000 


^2a(f) = 999.00000 


oo 2b(f) = 999.00000 





Note: * p < 0.05 and ** p < 0.01 (x 2 test). 

1 The sites in the sequence evolve according to the same process, the transition probability matrix is calculated only once for all sites for each branch. 

2 Background w. 

3 Foreground co. 

4 The number of amino acid sites estimated to have undergone positive selection; BEB: Bayes Empirical Bayes. 



Type I functional divergence (shifted evolutionary rate) 
and Type II functional divergence (altered amino acid 
physicochemical property) between gene clusters of the 
WRKY gene family were estimated by posterior analysis 
using the program DIVERGE v2.0 [66,67] . Because these 
methods are not sensitive to saturation of synonymous 
sites, they have been extensively applied in research of 
various gene families [68-70]. The estimation was based 
on the WRKY protein neighbor-joining tree, in which 
seven major clades were clearly present. Pairwise com- 
parisons of paralogous WRKY genes from group I, group 
II a-e, and group III were carried out, and the rate of 
amino acid evolution at each sequence position was esti- 
mated. Our results (Additional file 7) indicated that with 
nine exceptions (group pairs II d/II e, II d/III, II d/II a, 
II d/I, II e/III, III/II a, and II b/I), the coefficients of 
Type-I functional divergence (9 M l) between WRKY 
groups were moderately statistically significant (p < 0.05), 
with 6 ML values ranging from 0.201 to 0.395, or strongly 
statistically significant (p < 0.01) with 6 ML values ranging 



from 0.311 to 0.618. These results indicated significant 
site-specific altered selective constraints on some members 
of the WRKY family, leading to subgroup-specific func- 
tional evolution after diversification. Additionally, Type-I 
functional divergence was not evident in the comparison 
of group II d with the other four groups, which suggests 
that group II d may be the most conservative clade. Type- 
II functional divergence was evident in all groups or sub- 
groups (Additional file 8), with 6-II values ranging from 
0.033 to 0.288 (p < 0.05), indicating a radical shift of amino 
acid properties. These results suggest that the relative im- 
portance of Type-I and Type-II functional divergence 
might be associated with specific functional classes of the 
WRKY gene family in soybean. 

Furthermore, some critical amino acid residues respon- 
sible for functional divergence were predicted with suitable 
cut-off values derived from the Qk of each comparison. In 
order to reduce false positives, Qk > 0.8 was used as the 
cutoff to identify Type-I and Type-II functional divergence- 
related residues in all comparisons between the seven 
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WRKY groups or subgroups. Results showed distinct differ- 
ences in the number and distribution of predicted sites for 
functional divergence within each pairwise comparison. 
However, some critical amino acid sites still showed evi- 
dence of both Type-I and Type-II functional divergence in 
corresponding pairs. For example, five critical residues were 
predicted for the subgroup II e/II c (248E, 258P, 264Y, 
275R, 295 V) and II e/II b (248E, 258P, 275R, 276G, 298E) 
pairwise comparisons, whereas three critical amino acids 
sites were predicted for the subgroup II e/I (248E, 264Y, 
275R) and III/II b (264Y, 295 V, 298E) pairwise compari- 
sons, respectively. Similar cases were found in other sub- 
group pairwise comparisons. Shifted evolutionary rates and 
altered amino acid physicochemical properties co-occurred 
at the same amino acid sites, revealing that these sites have 
played important roles in functional divergence during the 
process of evolution. 

Expansion pattern of the WRKY family in soybean 

Gene duplication events are important for gene family 
evolution, because duplicated genes provide the raw ma- 
terials for the generation of new genes, which in turn fa- 
cilitate the generation of new functions. Three principal 
evolutionary patterns were attributed to gene dupli- 
cations: segmental duplication, tandem duplication, and 
transposition events such as retroposition and replicative 
transposition [71]. Among these, segmental duplication 
occurs most frequently in plants because most plants are 
diploidized polyploids and retain numerous duplicated 
chromosomal blocks within their genomes [72]. Previous 
studies reported several rounds of whole-genome dupli- 
cation (WGD) in both the Arabidopsis and rice genomes 
[73,74]. The occurrence of large-scale gene duplication 
events was also demonstrated in soybean [43]. For this 
analysis, we focused on the tandem and segmental dupli- 
cation modes. Tandem duplications were characterized 
as multiple members of one family occurring within the 
same intergenic region or in neighboring intergenic re- 
gions, where genes were clustered together with a max- 
imum of 10 extra genes between them [40]. We searched 
for contiguous WRKY genes in both the sharing region 
and neighboring regions. Eighteen of the 133 genes (13.5%) 
in this family were found to be located as tandem repeats 
in soybean (Figure 1), indicating that tandem duplications 
contributed to the expansion of this family. We also tested 
the hypothesis that segmental duplication events played a 
large role in the evolution of the WRKY gene family in soy- 
bean. For each WRKY gene, we tallied the number of 
flanking protein-coding genes with a best non-self match 
to a protein-coding gene neighboring its paralog. Ac- 
cording to our results, 76.7% (102 of 133) of genes showed 
high conservation, indicating that these WRKY genes were 
formed through segmental duplication in soybean (Table 2). 
Intriguingly, comparison of the 102 segmental duplicated 



genes in our study to the results of Du et al. [75] suggested 
that 91 (89.2%; 91 of 102) WRKY genes originated from 
WGDs, and the duplication status of the remaining 11 
(10.7%; 11 of 102) WRKY genes, including GlymaOlgOSOSO, 
Glyrna01g43420, Glyrna02gl5920, Glyrna02g45530, Glyma 
03g00460, Glyma03g31630, Glyrna08gl5210, GlymalOg 
03820, Glyrnal3g38630, Glyrnal8gl6170, and Glyma 
18g44030, was singleton, which indicated that these seg- 
mental duplication genes may be derived from independ- 
ent duplication events. These results indicated that most of 
the WRKY genes in soybean were retained after WGDs. 
Edger et al. [76] stated that dosage-sensitive genes, includ- 
ing transcription factors, were preferentially retained fol- 
lowing WGDs, which is compatible with the present study. 
We did not find evidence that other pairs of paralogous 
genes in soybean originated from segmental duplication. 
These results indicate that the soybean WRKY family arose 
mainly through segmental duplications. 

We also used Ks, as the proxy for time, and the con- 
served flanking protein-coding genes to estimate the 
dates of the segmental duplication events. The mean Ks 
values and the estimated dates for all segmental duplica- 
tion events corresponding to WRKY genes are listed in 
Table 2. The segmental duplicated events in soybean ap- 
pear to have occurred recently, and focus on two periods, 
10-20 mya and 40-60 mya, which is consistent with the 
ages of the genome duplication events [43]. Taken together, 
our analyses suggested that segmental duplication is the 
main mechanism for expansion of this WRKY gene family, 
accompanied by tandem duplications. 

Discussion 

Identification, classification, and phylogenetic analysis 
of the soybean WRKY gene family 

The genome sequence and transcriptome profiles of soy- 
bean provide a large amount of useful data to explore 
functional diversity from multiple perspectives. In this 
study, we identified 133 WRKY members in the soybean 
genome. A phylogenetic tree including 133 distinct pro- 
tein sequences clearly demonstrated that these genes 
could be divided into three groups. This classification 
was further supported by the results of motifs and exon/ 
intron analyses. The topology of our phylogenetic tree 
constructed from WRKY genes of two species (soybean 
and Arabidopsis) is largely consistent with that derived 
from Arabidopsis alone. All of the evidence obtained sug- 
gested that the classification was reasonable and reliable. 

WRKY transcription factors have their evolutionary 
origin in ancient eukaryotes, whose genomes contain a 
single WRKY gene with two WRKY domains. The pres- 
ence of a group I WRKY protein in these ancient organ- 
isms suggests that group I WRKY genes represent the 
ancestral form, with other groups arising later through 
losses and gains of WRKY domains [22]. In the present 
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Table 2 Estimates of the dates for the segmental duplication events of WRKY family in soybean 



Segment pairs 




Number of anchors 


KS (mean ± s.d.) 


Estimated tii 


Glyma01g05050 6 


StGlyma18g16170 


4 


0.60 ± 0.20 


49 


Glyma01g06550 £ 


St Glyma02g 12490 


3 


0.1 7 ±0.06 


14 


Glyma01g06870 6 


StGlyma02g12830 


4 


0.1 6 ±0.08 


13 


Glyma01g31920 6 


St Glyma03g05220 


5 


0.1 9 ±0.06 


16 


Glyma01g31920 6 


StGlyma18g44030 


3 


0.71 ±0.22 


58 


Glyma01g39600 £ 


StGlyma11g05650 


17 


0.1 7 ±0.07 


14 


Glyma01g43420 6 


St Glyma05g36970 


5 


0.70 ±0.1 9 


57 


Glyma01g43420 6 


S< Glyma08g02580 


4 


0.68 ±0.1 7 


56 


Glyma02g01420 6 


S< Glyma03g37940 


8 


0.67 ±0.11 


55 


Glyma02g01420 6 


StGlyma10g01450 


19 


0.20 ±0.11 


16 


Glyma02g01420 6 


StGlyma19g40560 


7 


0.72 ±0.1 7 


59 


Glyma02g15920 6 


StGlyma03g31630 


6 


0.60 ±0.1 6 


49 


Glyma02g15920 6 


StGlyma10g03820 


11 


0.1 5 ±0.11 


12 


Glyma02g36510 6 


StGlyma17g08170 


18 


0.1 3 ±0.05 


11 


Glyma02g45530 6 


StGlyma14g03280 


6 


0.1 2 ±0.05 


10 


Glyma02g46690 6 


St Glyma08g43770 


8 


0.61 ±0.19 


50 


Glyma02g46690 6 


StGlyma14g01980 


16 


0.1 3 ±0.07 


11 


Glyma02g47650 6 


StGlyma14g01010 


21 


0.1 4 ±0.08 


11 


Glyma03g00460 6 


StGlyma09g41050 


5 


0.51 ±0.12 


42 


Glyma03g31630 6 


StGlyma10g03820 


6 


0.55 ±0.10 


45 


Glyma03g33380 6 


StGlyma19g36100 


21 


0.1 9 ±0.1 7 


16 


Glyma03g37870 6 


StGlyma19g40470 


18 


0.1 5 ±0.07 


12 


Glyma03g37940 6 


StGlyma10g01450 


9 


0.73 ±0.1 6 


60 


Glyma03g37940 6 


StGlyma19g40560 


17 


0.1 5 ±0.07 


12 


Glyma03g38360 6 


StGlyma19g40950 


11 


0.1 6 ±0.09 


13 


Glyma03g41750 6 


St Glyma07g06320 


5 


0.66 ± 0.23 


54 


Glyma03g41750 6 


StGlyma16g02960 


3 


0.62 ± 0.26 


51 


Glyma03g41750 6 


StGlyma19g44380 


19 


0.1 8 ±0.1 3 


15 


Glyma04g08060 6 


StGlyma06g08120 


14 


0.1 8 ±0.1 2 


15 


Glyma04g39620 6 


St Glyma06g 15260 


14 


0.23 ±0.1 9 


19 


Glyma04g39650 6 


StGlyma05g31800 


4 


0.72 ±0.1 9 


59 


Glyma04g39650 6 


St Glyma06g 15220 


15 


0.22 ±0.1 9 


18 


Glyma04g39650 6 


St Glyma08g 15050 


4 


0.66 ±0.1 9 


54 


Glyma04g40130 6 


St Glyma06g 14720 


18 


0.25 ± 0.25 


20 


Glyma05g01280 6 


St Glyma06g20300 


6 


0.57 ± 0.21 


47 


Glyma05g25330 6 


St Glyma08g08340 


12 


0.21 ±0.19 


17 


Glyma05g25770 6 


St Glyma08g08720 


14 


0.20 ±0.1 5 


16 


Glyma05g29310 6 


St Glyma08g 12460 


18 


0.1 6 ±0.07 


13 


Glyma05g31800 £ 


StGlyma06g15220 


4 


0.63 ± 0.22 


52 


Glyma05g31800 £ 


StGlyma08g15050 


19 


0.1 4 ±0.08 


11 


Glyma05g36970 6 


St Glyma08g02580 


17 


0.22 ±0.1 5 


18 


Glyma05g37390 6 


StGlyma08g02160 


17 


0.1 4 ±0.08 


11 
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Table 2 Estimates of the dates for the segmental duplication events of WRKY family in soybean (Continued) 



Glyma06g15220 6 


StGlyma08g15050 


5 


0.68 ± 0.23 


56 


Glyma06g15260 6 


StGlyma08g15210 


3 


0.71 ±0.21 


58 


Glyma06g46420 6 


StGlyma12g10350 


9 


0.23 ±0.1 9 


19 


Glyma06g46420 6 


StGlyma13g38630 


6 


0.72 ±0.11 


59 


Glyma07g02630 6 


S< Glyma08g23380 


15 


0.22 ±0.1 8 


18 


Glyma07g02630 6 


StGlyma13g44730 


6 


0.56 ±0.1 8 


46 


Glyma07g02630 6 


StGlyma15g00570 


6 


0.54 ±0.1 5 


44 


Glyma07g06320 6 


StGlyma16g02960 


12 


0.1 6 ±0.07 


13 


Glyma07g06320 6 


StGlyma19g44380 


5 


0.61 ±0.17 


50 


Glyma07g36640 6 


^ Glyma15g 14860 


5 


0.59 ±0.21 


48 


Glyma07g36640 6 


StGlyma17g03950 


13 


0.23 ± 0.23 


19 


Glyma07g36640 6 


St Glyma09g03900 


5 


0.73 ±0.1 8 


60 


Glyma07g39250 6 


St Glyma09g00820 


5 


0.63 ±0.1 5 


52 


Glyma07g39250 6 


St Glyma15g 11680 


7 


0.65 ±0.1 8 


53 


Glyma07g39250 6 


StGlyma17g01490 


23 


0.1 7 ±0.11 


14 


Glyma08g23380 6 


StGlyma13g44730 


4 


0.49 ±0.1 8 


40 


Glyma08g23380 6 


StGlyma15g00570 


4 


0.52 ±0.1 7 


43 


Glyma08g26230 6 


StGlyma18g49830 


8 


0.21 ±0.08 


17 


Glyma08g43770 6 


StGlyma14g01980 


7 


0.56 ±0.1 3 


46 


Glyma09g00820 6 


StGlyma15g11680 


15 


0.20 ±0.14 


16 


Glyma09g00820 6 


StGlyma17g01490 


5 


0.67 ±0.21 


55 


Glyma09g03450 6 


St Glyma15g 14370 


14 


0.20 ±0.1 5 


16 


Glyma09g03900 6 


St Glyma15g 14860 


12 


0.25 ±0.21 


20 


Glyma09g03900 6 


StGlyma17g03950 


3 


0.59 ±0.1 7 


48 


Glyma09g06980 6 


StGlyma13g00380 


5 


0.69 ±0.1 8 


57 


Glyma09g06980 6 


StGlyma15g18250 


7 


0.22 ±0.21 


18 


Glyma09g06980 6 


StGlyma17g06450 


5 


0.56 ±0.11 


46 


Glyma09g39000 6 


StGlyma18g47350 


11 


0.1 9 ±0.10 


16 


Glyma09g39040 6 


StGlyma18g47300 


13 


0.1 8 ±0.10 


15 


Glyma09g41050 6 


StGlyma18g44560 


12 


0.1 6 ±0.06 


13 


Glyma10g01450 £ 


StGlyma19g40560 


6 


0.70 ±0.1 7 


57 


Glyma10g37460 6 


St Glyma20g30290 


11 


0.1 3 ±0.05 


11 


Glyma11g05650 6 


St Glyma17g 18480 


3 


0.71 ±0.23 


58 


Glyma12g10350 6 


StGlyma13g38630 


4 


0.73 ±0.10 


60 


Glyma12g33990 6 


StGlyma13g36540 


12 


0.23 ± 0.20 


19 


Glyma13g00380 £ 


StGlyma17g06450 


17 


0.1 9 ±0.1 3 


16 


Glyma13g17800 6 


StGlyma17g04710 


17 


0.20 ±0.1 7 


16 


Glyma13g44730 6 


StGlyma15g00570 


13 


0.1 2 ±0.06 


10 


Glyma14g11440 6 


StGlyma17g34210 


6 


0.32 ±0.1 9 


26 


Glyma14g11920 6 


StGlyma17g33920 


5 


0.1 7 ±0.05 


14 


Glyma15g11680 6 


StGlyma17g01490 


7 


0.67 ± 0.20 


55 


Glyma15g14860 6 


StGlyma17g03950 


5 


0.58 ± 0.24 


48 


Glyma15g18250 6 


StGlyma17g06450 


4 


0.68 ±0.1 6 


56 


Glyma16g02960 6 


StGlyma19g44380 


3 


0.58 ±0.1 7 


48 



Abbreviation: mya, million years ago. 
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study, phylogenetic analysis based on the relationship of 
different groups, indicated that domain gain and loss has 
indeed been a driving force in the expansion of the 
WRKY gene family. For example, subgroups II a, II b, 
and II c were phylogenetically closer to the C-terminal 
WRKY domain of group I. 

Glymal4g37960 and Glymal8g39970 were not assigned 
to any groups or subgroups. Glymal4g37960 has one 
WRKY domain; however, it is phylogenetically closer to 
group I N. Thus, Glymal4g37960 may have arisen from a 
two-domain WRKY protein that lost one of its WRKY do- 
mains during evolution, whereas in Glymal8g39970, a mu- 
tation in the sequence outside of the WRKY domain may 
have occurred before or after the domain loss. 

Transcriptome atlas, positive selection, and FDA 
of soybean WRKY proteins 

The transcriptome atlas revealed differential expression 
of the WRKY gene family under normal growth condi- 
tions. Furthermore, results of the promoter analysis were 
compatible with differential expression patterns. The ele- 
ments were distributed across three main functional 
categories, including biotic and abiotic stresses and 
developmental processes. Surprisingly, Skn-1 motif ele- 
ments, which are required for high levels of endosperm 
expression in cooperative interaction with other motifs 
(AACA, GCN4, ACGT) [77], were found to be abun- 
dant in all groups or subgroups. This result appears to 
contradict with the expression analysis, in which only one 
gene showed the highest transcript accumulation level in 
seed tissue. Since the function of c/s-acting elements is to 
regulate gene expression, we speculated that the reason for 
this phenomenon might be due to the deficiency of Skn-1 
motif element partners, AACA and ACGT elements, 
which were rarely detected in our study. On the other 
hand, according to the transcriptome atlas of the soybean 
WRKY gene family, the majority of GmWRKYs showed 
relatively reduced expression in seed development com- 
pared to other organs, which suggests that the expression 
of genes can be significantly affected when the Skn-1 motif 
lacks its partners. To further investigate the reason for this 
differential expression, we performed a positive selection 
analysis and a functional divergence analysis. 

We used both site models and branch-site models to 
detect positive selection. The results of site models indi- 
cated that one category of co did not fit the data well to 
describe the variability in selection pressures across 
amino acid sites. Therefore, the branch-site models, which 
allowed co ratios to vary among sites and lineages simul- 
taneously, appeared to be most suitable for describing 
evolutionary processes of the WRKY gene family. The 
branch-site analysis revealed that several sites were under 
positive selection. Along the group II e clade, the following 
sites were identified to be under positive selection: 248E, 



249Y, 286A, 288D, 298E, and 299G. Similar results were 
found in the group I, III, and II c clades. 

Figure 5 shows the locations of amino acid sites 
detected by PAML 4 in the 3D structure. Interestingly, 
with the exception of four amino acid sites (position 
258, 282, 293, and 294), sites in different groups or sub- 
groups were all located in the loop regions. Duan et al. 
[78] suggested that the DNA-binding ability of AtWRKY 
was mediated through the beta-hairpin regions between 
(32 and |33, and similar results were reported by Maeo 
et al. [79]. These results confirmed the theory proposed 
by Church et al. [80] for non-helical DNA binding. Fur- 
thermore, previous work on DNA binding of the WRKY 
family revealed that the conserved WRKYGQK region 
was important for DNA binding [79]. According to our 
results, the amino acid residues of bridging loops be- 
tween (3-strand regions may have been adapted for new 
functional roles during the process of evolution. 

Moreover, we further compared the number of WRKY 
genes in different groups or subgroups among Arabidopsis, 
rice, and soybean (Table 3). We observed that the number 
of members in different groups or subgroups was approxi- 
mately doubled in soybean than in corresponding groups 
or subgroups in Arabidopsis and rice, which can be attrib- 
uted to the more recent two genome duplication events in 
the soybean genome [43]. The key difference is that the 
number of group III members in soybean is roughly the 
same as that in Arabidopsis, but half of that in rice. This 
result may indirectly reflect the fact that group III in 
the dicotyledons may be confronted with strong positive 
Darwinian selection, whereas the evolution of this sub- 
group may be more conservative in the monocotyledons. 

Functional innovations include subfunctionalization [81], 
neofunctionalization [82], and subneofunctionalization 
[83]. Gene duplication may result in altered functional 
constraints between the gene clusters of a gene family. 
The results of the functional divergence analysis suggested 
that WRKY genes should be significantly functionally di- 
vergent from each other, especially with respect to the four 
amino acid residues (248E, 275R, 288D and 298E) identi- 
fied by both PAML 4 and DIVERGE 2.0 analyses, which 
were inferred to have played important roles during evolu- 
tion. On the other hand, functional divergence might re- 
flect the existence of long-term selective pressures. 

The soybean WRKY gene family arose mainly through 
segmental duplication 

The dramatic variation we observe in gene family size 
and distribution may have resulted owing to many pro- 
cesses, including tandem duplication with high rates of 
birth and death and gene duplication resulting from larger 
scale genome events such as polyploidy or duplications 
of chromosomal regions (here referred to as "segmental 
duplications"). 
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Figure 5 Model building of the 3D structure of the soybean WRKY protein {Glyma13g00380) based on similarity to the ^\tWRKY4-C 
domain (Protein Data Bank (PDB) code: 2lexA). The ensemble of the selected structures in stereo view (A), (B), (C), and (D) positive selection 
sites detected by the branch-site model presented in group I, group II c, group II e, and group III, respectively. The sites with red color indicate 
amino acid residues under statistically significant (p < 0.05) positive selection, as calculated by Bayes Empirical Bayes estimation methods. The 
presented region is Asp247-Pro306, excluding the N-terminal region. The figure was produced using the Swiss-model and pyMOL programs. 



The current investigation revealed the duplication pat- 
tern of the soybean WRKY gene family. One hundred 
two genes were found to evolve from segmental duplica- 
tion, suggesting that segmental duplication likely played 
a pivotal role in WRKY gene expansion in the soybean 
genome. The genome sequencing results revealed two 
genome duplication events in soybean, aging -13 mya 
and -57 mya [43], which is consistent with results of the 
present study. We inferred that the expansion of the 
WRKY gene family occurred along with genome duplica- 
tion events, and that these genes were retained during 
evolution. The structural similarity and variation be- 
tween genes located on the same chromosome and the 



phylogenetic analysis might help to explain the order 
of duplication events of the soybean WRKY genes on 
the same chromosome. For example, Glyma02g01420, 
Glyrna02gl2830, and Glyma02gl5920, which are located 
in different duplication blocks of the same chromosome, 
all have two introns flanked by three exons. However, 
phylogenetic analysis showed that Glyma02g01420 was 
more similar to Glymal0g01450, and Glyma02gl5920 
was more similar to Glymal0g03820, whereas Glyma 
02gl2830, which is located relatively close to Glyma02g 
15920, had no duplicate genes on chromosome 10. It is 
possible that the duplication of the same ancestral gene on 
chromosome 2 resulted in Glyma02gl2830 and the 



Table 3 Number of WRKY genes in Arabidopsis, rice, and soybean 

Groupl Group2a Group2b Group2c Group2d Group2e Group3 

/\fWRKY 13 4 7 18 7 9 14 

OsWRKY 15 4 8 15 7 11 36 

GmWRKY* 23 8 19 32 15 18 16 



Note: the WRKY proteins of soybean {Glycine max). 
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ancestor of Glymal0g014S0 and Glyrnal0g03820, which 
then evolved independently. The intron and exon se- 
quences of the ancestor gene might have elongated or 
shorten because of various reasons after it split into 
Glyma02g01420 and Glyma02glS920. Through segmen- 
tal duplication, the two chromosome segments, one 
contained in Glyma02gl2830 and the other contained 
in Glyma02g01420 and Glyma02gl5920, were inde- 
pendently copied to different parts of chromosome 
10. During subsequent evolution, the counterpart of 
Glyma02gl2830 was lost and structures for the counter- 
parts of Glyma02g01420 and Glyma02gl5920 changed 
by deletion or insertion of other fragments or partial se- 
quence repeat variations. Moreover, we noticed that 
both the Arabidopsis and rice genomes underwent re- 
cent duplication events, which also resulted in large- 
scale expansion of the WRKY gene family in their 
genomes. Therefore, we also examined the duplicated 
pattern of WRKY genes in these model species. 

The complete sequencing of the Arabidopsis genome 
revealed numerous large-scale segmental duplications 
[84]. Previous studies concluded that at least two rounds 
of duplications probably occurred in the Arabidopsis 
genome, with many losses and rearrangements leaving a 
mosaic of "segmental duplications" or "duplication blocks" 
[74,84]. Most duplication blocks appear to have originated 
from one round of polyploidy, as estimated by using vari- 
ous methods, that occurred 20-40 mya, before the evolu- 
tion of the genus Brassica but after the separation of 
Brassicaceae from other closely related eudicot families 
[74]. Results of the present study showed that no appar- 
ent tandem duplication events, and rare segmental dupli- 
cation events (six pairs), exist in the Arabidopsis WRKY 
gene family. Furthermore, the estimated time of the six 
pairs of segmental duplicated genes focus on the period 
of 24-27 mya (Additional file 9). Cannon at el. [72] found 
nine distinct pairs of duplicated segments and no tandem 
duplication events in the Arabidopsis WRKY family, 
which is compatible with our study. Comparison of six 
pairs of segmental duplicated genes in our study with the 
results of Blanc et al. [85] suggested that only one pair of 
genes {Atlgl3960 and At2g033400) originated from poly- 
ploidy in Arabidopsis. Consequently, we speculated that 
the other five segmental duplicated genes might have de- 
rived from independent segmental duplication events. The 
long period of time over which genome evolution has oc- 
curred has provided many opportunities for functional di- 
vergence in the genes that arose from duplications. Our 
results did not reveal evidence that other pairs of WRKY 
genes in Arabidopsis originated from duplicated blocks. 
Therefore, most of the Arabidopsis WRKY genes may have 
lost their paralogous genes after genome duplication [74]. 

With respect to rice, the expansion patterns of WRKY 
gene family have been clearly demonstrated. Ramamoorthy 



et al. [40] predicted 103 genes encoding WRKY transcrip- 
tion factors in rice, and the majority of rice WRKY genes 
(77.7%; 80 of 103) were detected on duplicated blocks. Of 
the WRKY genes, 45.6% (47 of 103) of WRKY genes were 
found to have corresponding coordinates generated by seg- 
mental duplications. Furthermore, 35.0% (36 of 103) of the 
WRKY genes were clustered together with a maximum of 
10 extra genes between them, and were regarded as 
tandemly duplicated genes. The results above were con- 
firmed by Jiang et al. [86]. That is, that both tandem and 
segmental duplication significantly contributed to the ex- 
pansion of the WRKY gene family in rice. 

All of the evidence suggests that the evolutionary pat- 
terns of the WRKY gene family differ between soybean, 
rice, and Arabidopsis. Species-specific expansion played 
an important role in the evolution of this family in 
plants. Segmental duplication appears to be the dominant 
mechanism for the generation of duplicated genes in soy- 
bean, whereas segmental duplication and tandem dupli- 
cation may play similar roles in the expansion of the rice 
WRKY gene family. Moreover, although Arabidopsis may 
have a tetraploid ancestor, the majority of its duplicated 
genes appear to have been lost throughout evolutionary 
processes. 

Conclusion 

Previous studies have demonstrated that members of the 
WRKY gene family play important roles in the regulation 
of several plant developmental processes and in responses 
to various biotic and abiotic stresses. Results of the present 
study indicate that segmental duplication has likely been 
the dominant mechanism of gene amplification during the 
expansion of the WRKY family in soybean. Furthermore, 
positive selection could be the main driving forces for the 
functional divergence of duplicated genes, which may have 
played a critical role in the responses of plants to various 
stresses throughout their evolutionary history. The results 
of this study will not only further our understanding of the 
evolutionary processes of soybean WRKY genes, but will 
also help to enhance functional genomics studies of WRKY 
transcription factors in an important model system. 

Methods 

Sequence collection 

Seventy WRKY protein sequences downloaded from 
AtTFDB (http://arabidopsis.med.ohio-state.edu/AtTFDB/) 
were used to BLAST against the soybean genome data- 
base, Phytozome v8.0 (http://www.phytozome.net /soy- 
bean), using the BLASTP program. Sequences were 
selected as candidate proteins if their E value was < le-10. 
For each query sequence, information of the location on 
chromosomes, genomic sequences, full coding sequences 
(CDS), and protein sequences were collected from 
Phytozome, and redundant genes were removed manually. 
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The Simple Modular Architecture Research Tool (SMART; 
http://smart.embl-heidelberg.de/) was used to confirm each 
predicted WRKY member. 

Phylogenetic tree construction and sequence analysis 

The SMART program was used to extract the protein 
sequences of the WRKY domain for each protein. Mul- 
tiple sequence alignment of domain sequences of 133 
WRKY family proteins from soybean and 70 protein se- 
quences from Arabidopsis was performed using the Clustal 
X 1.83 program with default parameters, and a phylogen- 
etic tree was generated and viewed using MEGA Version 
5.0. Exon and intron organizations of soybean WRKY 
genes were determined by comparing predicted CDS with 
their corresponding genomic sequences using GSDS 
(http://gsds.cbi.pku.edu.cn/) software. Motifs of paralogous 
WRKY proteins were identified statistically using MEME 
with default settings, except that the maximum number of 
motifs to find was set at 10. 

RNA-Seq atlas and promoter analysis 

RNA-Seq data were introduced to further analyze the 
expression of GmWRKY genes. Data was normalized using 
a variation of the reads/Kb/Million method, and Z-score 
analysis was obtained from SoyBase (http://soybase.org/ 
soyseq/) [44,87]. A heat map was generated using the 
GenePattern program (http://www.broadinstitute.org/ 
cancer/software/genepattern/index.html). The ds-acting 
elements that regulate gene expression are distributed in 
300-3000 bp upstream of the coding region, also take 
into consideration of sequence restriction in PlantCARE 
(http://bioinformatics.psb.ugent.be/webtools/plantcare/ 
html/) and the methods described by Liu et al. [68], 
therefore, 1500 bp upstream of the coding region were 
selected as promoter sequence and were downloaded 
from Phytozome (www.phytozome.net) and Soybean Func- 
tional Genomics Database (bioinformatics.cau.edu.cn). 
Then these sequences were submitted to PlantCARE for 
in silico analysis. 

Positive selection and functional divergence 

A maximum likelihood method in PAML was applied to 
test the hypothesis of positive selection in the WRKY 
gene family [63] under the site model and branch- site 
model. In the site model, two pairs of models were 
contrasted to test the selective pressures at codon sites. 
First, models MO (one ratio) and M3 (discrete) were 
compared, using a test for heterogeneity between codon 
sites in the dN/dS ratio value, co. The second comparison 
was M7 (beta) versus M8 (beta + co > 1). Meanwhile, we 
introduced the likelihood ratio test (LRT) to compare 
the two extreme models. When the LRT suggested posi- 
tive selection, the Bayes empirical Bayes (BEB) method 
was used to calculate the posterior probabilities that 



each codon was from the site class of positive selection 
under models M3 and M8. 

The branch-site model assumes that the co ratio varies 
between codon sites and that there are four site classes 
in the sequence. The first class of sites is highly con- 
served in all lineages with a small co ratio, co 0 . The sec- 
ond class includes neutral or weakly constrained sites 
for which co = CO!, where CO! is near or smaller than 1. In 
the third and fourth classes, the background lineages 
show co 0 or CO!, but foreground branches have co 2 , which 
may be greater than 1. When constructing the LRTs, the 
null hypothesis fixes co 2 = 1, allowing sites evolving under 
negative selection in the background lineages to be re- 
leased from constraint and to evolve neutrally on the 
foreground lineage; the alternative hypothesis constrains 
co 2 > 1 [64]. Posterior probabilities (Qks) were calculated 
using the BEB method [65]. 

The software DIVERGE was used to reveal the functional 
divergence between members of the WRKY protein family. 
The coefficients of Type-I and Type-II functional diver- 
gence (9-1 and 6-II, respectively) between any two clusters 
of interest were calculated. A 6-1 or 6-II significantly > 0 in- 
dicates site-specific altered selective constraints or a radical 
shift of amino acid physicochemical properties after gene 
duplication and/or speciation [66]. Moreover, Qk was used 
to predict critical amino acid residues that were responsible 
for functional divergence. In this study, we screened the 
codons (Qk > 0.8) as potential sites that were crucial for 
functional divergence. 

Analysis of WRKY gene expansion patterns and dating 
the duplication events 

In this study, we focused on two patterns of gene expan- 
sion: tandem duplication and segmental duplication. Tan- 
dem duplications were characterized as multiple members 
of this family occurring within the same or neighboring 
intergenic regions, where the WRKY genes were clustered 
together with a maximum of 10 extra genes between them 
[40]. Segmental duplications of each WRKY gene within 
the family in soybean and Arabidopsis genomes were 
searched in the PGDD (http://chibba.agtec.uga.edu/ 
duplication/). Within the range of 100 kb, the anchors 
with synonymous substitution rates (Ks) values greater 
than 1.0 were discarded because of the risk of satur- 
ation. Assuming a molecular clock, the Ks values of du- 
plicated genes are expected to be similar over time [88]. 
Therefore, we used Ks values to estimate the dates of 
the segmental duplication events. The mean Ks value 
was calculated for each pair of genes within a duplicated 
block and was then used to date the duplication events. 
The approximate date of the duplication event was 
then calculated using the mean Ks values (T = Ks/2A), 
assuming clock-like rates (X) of synonymous substitution 
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of 1.5 x 10 substitutions/synonymous site/year for 
Arabidopsis [89] and 6.1 x 10" 9 for soybean [90]. 

Additional files 



Additional file 1: WRKY gene family in soybean. 

Additional file 2: Schematic diagram of amino acid motifs of 
soybean WRKY proteins from different groups (or subgroups). Motif 
analysis was performed using MEME, as described in the Methods. The 
grey solid line represents the corresponding WRKY protein and its length. 
The different-colored boxes represent different motifs and their position 
in each WRKY sequence. A detailed motif introduction is shown in 
Additional file 3. 

Additional file 3: Schematic diagram of WRKY protein motifs. The 

schematic diagram was derived from MEME. The order of motifs of WRKY 
proteins in the diagram was automatically generated by MEME according 
to scores. 

Additional file 4: Exon/intron structures of soybean WRKY genes. 

The boxes and lines represent exons and introns, respectively. The bold, 
dark blue lines indicate the 3' downstream region. The WRKY genes were 
separated according to group or subgroup. 

Additional file 5: Promoter analysis of the soybean WRKY gene 
family. The locus names and c/'s-acting element names are listed. 

Additional file 6: Tests for positive selection among codons of 
WRKY genes using site models. 

Additional file 7: Maximum likelihood estimates of the coefficient 
of Type-I functional divergence (0) from pairwise comparisons 
between WRKY groups. Posterior probability (PP) of the site-specific 
Type-I functional divergence is provided. 

Additional file 8: Estimates of the coefficient of Type-ll functional 
divergence (6). 

Additional file 9: Estimates of the dates for the segmental 
duplication events of WRKY family in Arabidopsis. 
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